Transcribing Broadcast News: The LIMSI Nov96
نویسندگان
چکیده
In this paper we report on the LIMSI Nov96 Hub4 system for transcription of broadcast news shows. We describe the development work in moving from laboratory read speech data to realworld speech data in order to build a system for the ARPA Nov96 evaluation. Two main problems were addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers). The speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and n-gram statistics estimated on large text corpora. The base acoustic models were trained on the WSJ0/WSJ1 corpus, and adapted using MAP estimation with 35 hours of transcribed task-specific training data. The 65k language models are trained on 160 million words of newspaper texts and 132 million words of broadcast news transcriptions. The problem of segmenting the continuous stream of data was investigated using 10 MarketPlace shows. The overall word transcription error of the Nov96 partitioned evaluation test data was 27.1%.
منابع مشابه
Transcribing Broadcast News: The LIMSI Nov96 Hub4 System
In this paper we report on the LIMSI Nov96 Hub4 system for transcription of broadcast news shows. We describe the development work in moving from laboratory read speech data to realworld speech data in order to build a system for the ARPA Nov96 evaluation. Two main problems were addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the sign...
متن کاملTranscribing broadcast news shows
While significant improvements have been made over the last 5 years in large vocabulary continuous speech recognition of large read-speech corpora such as the ARPA Wall Street Journal-based CSR corpus (WSJ) for American English and the BREF corpus for French, these tasks remain relatively artificial. In this paper we report on our development work in moving from laboratory read speech data to r...
متن کاملThe LIMSI Broadcast News transcription system
This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or ‘found’ speech data in preparation for the ARPA Nov96, Nov97 and Nov98 evaluations. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These...
متن کاملRecent advances in transcribing television and radio broadcasts
Transcription of broadcast news shows (radio and television) is a major step in developing automatic tools for indexation and retrieval of the vast amounts of information generated on a daily basis. Broadcast shows are challenging to transcribe as they consist of a continuous data stream with segments of different linguistic and acoustic natures. Transcribing such data requires addressing two m...
متن کاملTranscription and indexation of broadcast data
In this paper we report on recent research on transcribing and indexing broadcast news data for information retrieval purposes. The system described here combines an adapted version of the LIMSI 1998 Hub-4E transcription system for speech recognition with textbased IR methods. Experimental results are reported in terms of recognition word error rate and mean average precision for both the TREC ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997